The dataset has been sourced from: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. http://www3.dsi.uminho.pt/pcortez/wine/
The quality of wine depends on many factors. The two datasets here are related to red and white variants of the Portuguese “Vinho Verde” wine. Only physicochemical (inputs) and sensory (the output) variables are available. The author had chosen 11 attributes that he thought could be significant factors that contribute to the quality of the wine. There are 6498 observations were collected out of which 1599 red and 4898 white examples between May/2004 to February/2007.
These datasets are publicly available for research purposes only. and its being publicly available in https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Improving the quality and marketing of vinho verde has always been the moto of CVRVV which is an official certification entity, A computerized system (iLab) has been used to record data. The wine sample testing from requests from producer to laboratory and sensory analysis is automatically managed by iLAb. Each entry denotes a given test (analytical or sensory) and the final database was exported into a single sheet (.csv).
Variable name Description Measurement type Role fixed acidity low volatility organic acids level numeric predictor volatile acidity level of short-chain organic acids numeric predictor citric acid level of citric acid numeric predictor chlorides level of chlorides numeric predictor free sulfur dioxide level of free sulfur dioxide numeric predictor total sulfur dioxide level of total sulfur dioxide numeric predictor density the density of the wine numeric predictor pH the pH of the wine numeric predictor sulphates level of sulphates numeric predictor alcohol level of alcohol numeric predictor residual sugar level of residual sugars numeric predictor quality quality of the wine Integer outcome Color color of wine categorical/alpha outcome/predictor
a)Is there a significant association between the sulphates in the wine and the quality of the wine?
b)Is there a significant association between the level of alcohol and the quality of the wine?
c)Is there a significant association between the volatile acidity and quality of the wine?
We use Bayesian Normal regression model Yi|β0,β1,σ∼N(μi,σ^2) with The outcome variable here is ‘quality’ which is discrete -The authors didn’t provide enough detail on the process that the wine is made from farm to the bottle, so we also assume that the observed value of the quality is independent of the remaining observed values of quality taking any predictor into account. -The outcome quality can be written as a linear function of the predictors. model : μi=β0+β1(X) Model 1: μi=β0+β1(sulphates) Model 2: μi=β0+β1(alcohol) Model 3: μi=β0+β1(volatile acidity) -The observed values of output variable quality will vary normally around the mean ’μ’with consistent standard diviation σ, for any value of predictor X. As we need Markov chains to simulate probability models and also As we are going to deal with the β0,β1,σ. The type of GLM that comes in handy for most of the generalized linear regression models for bayesian approach is stan_glm().
For each of the model parameters we take the same prior PDFs The intercept β0~N(6, 1.5^2) slope regression parameters β1~N(0, 1^2) standard deviation parameter σ~Exp(1)
Cortez et al.(2009) discusses that Wine Quality highly depends on the sulphates present in it as it is very important in improving the wine aroma. What is the basis for the statement that wine quality is associated with the sulphates in the wine.hence the null hypothesis is framed.
Cortez et al.(2009) discusses that Wine Quality increases as the level of alcohol increases. What is the basis for the statement that wine quality is associated with the level of alcohol in the wine. hence the null hypothesis is framed.
Cortez et al.(2009) discusses that the the volatile acidity has a negative impact on the quality of the wine.What is the basis for the statement that wine quality is associated with the volatile acidity in the wine. hence the null hypothesis is framed.
Apart form the intercept β0~N(6, 1.5^2) we are dealing with the weak priors here so we use autoscale technique to tune the prior PDFs’ parameter values. prior = normal(0, 1, autoscale = TRUE)
library(bayesrules)
library(tidybayes)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.7
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(bayesplot)
## This is bayesplot version 1.9.0
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.5, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
##
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
##
## extract
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.3
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
##
## Attaching package: 'rstanarm'
## The following object is masked from 'package:rstan':
##
## loo
library(broom.mixed)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
project1<- read.csv(file = "~/Downloads/untitled/untitled folder/5800/White_Red_Wine_quality.csv",header = TRUE)
project1<-na.omit(project1)
head(project1)
project1$Color<-as.factor(project1$Color)
summary(project1)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality Color
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000 Red :1599
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000 White:4898
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
We are excluding the Color of the wine which is a categorical variable that i have created in the csv file but wasn’t there in the original datasets provided, i have merged the two datasets one on red wine and another on white wine from the same context and for my comfort i have created a column called Color for future analysis purposes.
ggpairs(data=project1,
columns=c("quality","pH","alcohol","density","citric.acid","residual.sugar","total.sulfur.dioxide","volatile.acidity","fixed.acidity","chlorides","free.sulfur.dioxide","sulphates"))
The Bivariate Matrix here is not very clear as there are many variables involved in the dataset. but when it comes to quality we can just see that density, alcohol, volatile acidity are correlated with quality of the wine. But it is not clear enough to go with that conclusion. so we need further analysis to understand the relation between the quality of the wine and the remaining predictors.
We are going to use the Bayesian Normal regression model where we take the prior intercept β0~N(6, 1.5^2) because from the summary we can see that the min value of the wuality is 3 and max is 9 so we have taken the mean as 6 and the standard deviation is 6-3/2 and 9-3/2 which is 3/2 i.e, 1.5. As there is not enough data provided on the predictors we are assuming that all the predictors are independent of each other and also due lack of information on the predictors we are going to consider the priors as weak priors and use 0 as mean and 1 as standard deviation making the slope regression parameters β1~N(0, 1^2). and we use standard deviation parameter σ~Exp(1). we use the autoscale feature in the stan_glm() to adjust or tune the priors as we are using weak priors.
Wine_model_1 <- stan_glm(
quality ~pH+alcohol+density+citric.acid+ residual.sugar+total.sulfur.dioxide+volatile.acidity+fixed.acidity+chlorides+free.sulfur.dioxide+sulphates,
data = project1, family = gaussian,
prior_intercept = normal(6, 1.5, autoscale = TRUE),
prior = normal(0, 1, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 2002)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 8.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.85 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.20182 seconds (Warm-up)
## Chain 1: 3.85146 seconds (Sampling)
## Chain 1: 5.05327 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 2.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.16989 seconds (Warm-up)
## Chain 2: 4.60152 seconds (Sampling)
## Chain 2: 5.77141 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 4.6e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.46 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 1.3314 seconds (Warm-up)
## Chain 3: 4.25 seconds (Sampling)
## Chain 3: 5.58139 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 1.18845 seconds (Warm-up)
## Chain 4: 3.37956 seconds (Sampling)
## Chain 4: 4.56801 seconds (Total)
## Chain 4:
summary(Wine_model_1 )
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: quality ~ pH + alcohol + density + citric.acid + residual.sugar +
## total.sulfur.dioxide + volatile.acidity + fixed.acidity +
## chlorides + free.sulfur.dioxide + sulphates
## algorithm: sampling
## sample: 20000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 6497
## predictors: 12
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 55.6 12.0 40.3 55.8 71.0
## pH 0.4 0.1 0.3 0.4 0.6
## alcohol 0.3 0.0 0.2 0.3 0.3
## density -54.8 12.3 -70.5 -55.0 -39.2
## citric.acid -0.1 0.1 -0.2 -0.1 0.0
## residual.sugar 0.0 0.0 0.0 0.0 0.1
## total.sulfur.dioxide 0.0 0.0 0.0 0.0 0.0
## volatile.acidity -1.3 0.1 -1.4 -1.3 -1.2
## fixed.acidity 0.1 0.0 0.0 0.1 0.1
## chlorides -0.5 0.3 -0.9 -0.5 0.0
## free.sulfur.dioxide 0.0 0.0 0.0 0.0 0.0
## sulphates 0.8 0.1 0.7 0.8 0.9
## sigma 0.7 0.0 0.7 0.7 0.7
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 5.8 0.0 5.8 5.8 5.8
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.1 1.0 7986
## pH 0.0 1.0 10907
## alcohol 0.0 1.0 9606
## density 0.1 1.0 7947
## citric.acid 0.0 1.0 18076
## residual.sugar 0.0 1.0 8795
## total.sulfur.dioxide 0.0 1.0 15354
## volatile.acidity 0.0 1.0 16233
## fixed.acidity 0.0 1.0 9540
## chlorides 0.0 1.0 18943
## free.sulfur.dioxide 0.0 1.0 17804
## sulphates 0.0 1.0 17428
## sigma 0.0 1.0 21332
## mean_PPD 0.0 1.0 20585
## log-posterior 0.0 1.0 8690
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
The mean posterior predictive density which is a measure of how well the model fits the data is positive and decently high.By looking at the MCMC diagnostics The mcse is zero for all the predictors and the rhat is 1 indicating there is no auto correlation.indicating we have sampled the posterior distribution fairly well and therefore considered reliable.And also the effective sample size is more than 10,000 indicating we have sampled the probability space enough to have a good representation.
prior_summary(Wine_model_1)
## Priors for model 'Wine_model_1'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 6, scale = 1.5)
## Adjusted prior:
## ~ normal(location = 6, scale = 1.3)
##
## Coefficients
## Specified prior:
## ~ normal(location = [0,0,0,...], scale = [1,1,1,...])
## Adjusted prior:
## ~ normal(location = [0,0,0,...], scale = [ 5.43, 0.73,291.21,...])
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 1.1)
## ------
## See help('prior_summary.stanreg') for more details
mcmc_trace(Wine_model_1, size = 0.1)
by applying mcmc_trace() to the model we can see that all of these have fairly stable results.
mcmc_dens_overlay(Wine_model_1)
By applying mcmc_dens_overlay() from the plots we can see that the even after running the experiment 4 times the results are always close to each other.
mcmc_acf(Wine_model_1)
After applying mcmc_acf() The graphs show the degrees to which successive values of simulation are related to each other. by looking at these graphs we can say that the correlation has been high in the initial stages of simulation. but if we observe successive observations of auto correlation has come down and they are almost equal to zero. which shows that the results are credible. and also as the variables are more the graphs are not very clearer.
neff_ratio(Wine_model_1)
## (Intercept) pH alcohol
## 0.39930 0.54535 0.48030
## density citric.acid residual.sugar
## 0.39735 0.90380 0.43975
## total.sulfur.dioxide volatile.acidity fixed.acidity
## 0.76770 0.81165 0.47700
## chlorides free.sulfur.dioxide sulphates
## 0.94715 0.89020 0.87140
## sigma
## 1.06660
rhat(Wine_model_1)
## (Intercept) pH alcohol
## 1.0001053 1.0000154 0.9999997
## density citric.acid residual.sugar
## 1.0001081 0.9999115 1.0000697
## total.sulfur.dioxide volatile.acidity fixed.acidity
## 1.0001988 1.0000465 1.0001619
## chlorides free.sulfur.dioxide sulphates
## 0.9999573 1.0001568 0.9998802
## sigma
## 1.0001435
The draws within a Markov chain are not independent we can say there is a chance of autocorrelation existing for pH,alcohol, density, residual.sugar, fixed.acidity as the neff is less than one but for remaining predictors the effective sample size ratios are slightly below or near to 1 and the R-hat values are very close to 1, indicating that the chains are stable, mixing quickly, and behaving much like an independent sample.
posterior_interval(Wine_model_1, prob = 0.90)
## 5% 95%
## (Intercept) 35.833292665 75.225785373
## pH 0.287558173 0.587605857
## alcohol 0.239469996 0.294959846
## density -74.832789043 -34.657289390
## citric.acid -0.242169680 0.021013613
## residual.sugar 0.034960288 0.052016930
## total.sulfur.dioxide -0.002929548 -0.002016838
## volatile.acidity -1.457327157 -1.199869362
## fixed.acidity 0.041804717 0.093017098
## chlorides -1.026374216 0.066412095
## free.sulfur.dioxide 0.004708779 0.007211581
## sulphates 0.640190250 0.895576908
## sigma 0.724776562 0.746055832
pp_check(Wine_model_1)
After performing pp_check function we can see that the predictive values are normally distributed and the observed values are uneven considering the nature of the output variable we have taken. Hence i understood that the output variable i.e., quality of wine that i have taken with its true form cannot give an interpretable output when the funtion is applied. hence we cannot derive any concrete conclusion by using the graphical representation of the observed and predicted values.
output<-tidy(Wine_model_1, effects=c("fixed","aux"),
conf.int=TRUE,
conf.level=0.95)
output
d0 = which(output$conf.low < 0)
d1 = which(output$conf.high > 0)
output$term[intersect(d0, d1)]
## [1] "citric.acid" "chlorides"
From the tidy function summary it is clear that the citric acid and chlorides have the confidence intervals which includes zero which indicates that the quality has got no association with “citric.acid” and “chlorides”. If the remaining predictors are concerned the predictors which have positive relation with quality of the wine are pH, alcohol, residual.sugar, fixed.acidity, free.sulfur.dioxide, sulphates and the predictors which have negative relation are density, total.sulfur.dioxide, volatile.acidity. we can observe for the whole model the conficence intervals are very narrow and the standard error is very low. The decent estimates can be seen only for density, volatile.acidity, pH, alcohol, sulphates.and we can also observe that the ppd is positive and fairly high indicating the model derived from the given variables is a good-fit model.
sulphates has got a decent estimate which is 0.77 and the confidence intervals doesn’t contain zero in their range which shows that there is a positive association between quality of wine and sulphates leaving us no choice but to reject the null hypothesis.
level of alcohol has got a decent estimate which is 0.26 and the confidence intervals doesn’t contain zero in their range which shows that there is a positive association between quality of wine and level of alcohol leaving us no choice but to reject the null hypothesis.
volatile acidity has got a decent estimate which is -1.32 and the confidence intervals doesn’t contain zero in their range which shows that there is a negative association between quality of wine and volatile acidity leaving us no choice but to reject the null hypothesis.
This proves that the statements provided in the reference material are true while stating the associations between the predictors taken adn the outcome variable.
set.seed(2002)
predictions <- posterior_predict(Wine_model_1, newdata = project1)
dim(predictions)
## [1] 20000 6497
ppc_intervals(project1$quality,
yrep = predictions,
prob = 0.5,
prob_outer = 0.95)
Here we can see that the x axis is data point index and also we can observe that the predicted values fall in certain range and the observed value falls in the predicted range.As we go to the higher end of the spectrum on y axis which is >=4 to <=8 we see that most of the observed values are encompassed by the predicted values, and as we go to the lower end of spectrum we see some descrepant values.we can see some outliers on y=2 or 3 and 9 but we can say that our model is a good predictor for most of the values and bad predictor for a few values.
set.seed(2002)
prediction_summary(Wine_model_1, data = project1)
Almost 94.18% of the predicted values fall in the 95% range.indicating the model is good one.
Wine_model_2 <- stan_glm(
quality ~ density+volatile.acidity+pH+alcohol+sulphates,
data = project1, family = gaussian,
prior_intercept = normal(6, 1.5, autoscale = TRUE),
prior = normal(0,1, autoscale = TRUE),
prior_aux = exponential(1, autoscale = TRUE),
chains = 4, iter = 5000*2, seed = 2002)
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000424 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.24 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.399222 seconds (Warm-up)
## Chain 1: 2.65273 seconds (Sampling)
## Chain 1: 3.05195 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.379004 seconds (Warm-up)
## Chain 2: 2.724 seconds (Sampling)
## Chain 2: 3.103 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 2.7e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.37344 seconds (Warm-up)
## Chain 3: 2.71407 seconds (Sampling)
## Chain 3: 3.08751 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 1.8e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 10000 [ 0%] (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%] (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%] (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%] (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%] (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%] (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%] (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%] (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%] (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%] (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%] (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.369657 seconds (Warm-up)
## Chain 4: 2.68923 seconds (Sampling)
## Chain 4: 3.05889 seconds (Total)
## Chain 4:
summary(Wine_model_2 )
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: quality ~ density + volatile.acidity + pH + alcohol + sulphates
## algorithm: sampling
## sample: 20000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 6497
## predictors: 6
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -25.2 4.8 -31.3 -25.2 -19.0
## density 27.3 4.8 21.2 27.3 33.4
## volatile.acidity -1.6 0.1 -1.6 -1.6 -1.5
## pH 0.1 0.1 0.0 0.1 0.2
## alcohol 0.4 0.0 0.3 0.4 0.4
## sulphates 0.5 0.1 0.4 0.5 0.5
## sigma 0.7 0.0 0.7 0.7 0.8
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 5.8 0.0 5.8 5.8 5.8
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 13651
## density 0.0 1.0 13707
## volatile.acidity 0.0 1.0 18586
## pH 0.0 1.0 19308
## alcohol 0.0 1.0 14692
## sulphates 0.0 1.0 17099
## sigma 0.0 1.0 20438
## mean_PPD 0.0 1.0 19801
## log-posterior 0.0 1.0 8988
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
The mean posterior predictive density which is a measure of how well the model fits the data is positive and fairly high.By looking at the MCMC diagnostics The mcse is zero for all the predictors and the rhat is 1 indicating there is no auto correlation.indicating we have sampled the posterior distribution fairly well and therefore considered reliable.And also the effective sample size is more than 10,000 indicating we have sampled the probability space enough to have a good representation.
prior_summary(Wine_model_2)
## Priors for model 'Wine_model_2'
## ------
## Intercept (after predictors centered)
## Specified prior:
## ~ normal(location = 6, scale = 1.5)
## Adjusted prior:
## ~ normal(location = 6, scale = 1.3)
##
## Coefficients
## Specified prior:
## ~ normal(location = [0,0,0,...], scale = [1,1,1,...])
## Adjusted prior:
## ~ normal(location = [0,0,0,...], scale = [291.21, 5.30, 5.43,...])
##
## Auxiliary (sigma)
## Specified prior:
## ~ exponential(rate = 1)
## Adjusted prior:
## ~ exponential(rate = 1.1)
## ------
## See help('prior_summary.stanreg') for more details
mcmc_trace(Wine_model_2, size = 0.1)
After applying the mcmc_trace() function we can see that all of these has fairly stable results.
mcmc_dens_overlay(Wine_model_2)
From these density overlay plots we can see that the even after running the experiment 4 times the results are always close to each other.
mcmc_acf(Wine_model_2)
The graphs show the degrees to which successive values of simulation are related to each other. by looking at these graphs we can say that the correlation has been high in the initial stages of simulation. but if we observe successive observations of auto correlation has come down and they are almost equal to zero. which shows that the results are credible.
neff_ratio(Wine_model_2)
## (Intercept) density volatile.acidity pH
## 0.68255 0.68535 0.92930 0.96540
## alcohol sulphates sigma
## 0.73460 0.85495 1.02190
rhat(Wine_model_2)
## (Intercept) density volatile.acidity pH
## 1.000483 1.000493 1.000024 1.000077
## alcohol sulphates sigma
## 1.000277 1.000084 1.000135
The effective sample size ratios are slightly below 1 and the R-hat values are very close to 1, indicating that the chains are stable, mixing quickly, and behaving much like an independent sample.
posterior_interval(Wine_model_2, prob = 0.90)
## 5% 95%
## (Intercept) -33.16140287 -17.3595712
## density 19.53032858 35.1714546
## volatile.acidity -1.67064357 -1.4652559
## pH 0.01117182 0.2118154
## alcohol 0.34422684 0.3811096
## sulphates 0.34947279 0.5730320
## sigma 0.73420049 0.7554154
pp_check(Wine_model_2)
After performing pp_check function we can see that the predictive values are normally distributed and the observed values are uneven considering the nature of the output variable we have taken. Hence i understood that the output variable i.e., quality of wine that i have taken with its true form cannot give an interpretable output when the funtion is applied. hence we cannot derive any concrete conclusion by using the graphical representation of the observed and predicted values.
output1<-tidy(Wine_model_2, effects=c("fixed","aux"),
conf.int=TRUE,
conf.level=0.95)
output1
d2 = which(output1$conf.low < 0)
d3 = which(output1$conf.high > 0)
output1$term[intersect(d2, d3)]
## [1] "pH"
From the tidy function summary it is clear that the “pH” has the confidence intervals which includes zero which indicates that the quality has got no association with “pH”. If the remaining predictors are concerned the predictors which have positive relation with quality of the wine are alcohol,density, sulphates and the predictor which has negative relation is volatile.acidity. we can observe for the whole model the confidence intervals are very narrow and the standard error is very low. The decent estimates can be seen only for density and volatile.acidity.and we can also observe that the ppd is positive and fairly high indicating the model derived from the given variables is a good-fit model.
Even here the associations of sulphates, volatile acidity and level of alcohol did not change proving that the null hypothesis can be rejected.
set.seed(2002)
predictions1 <- posterior_predict(Wine_model_2, newdata = project1)
dim(predictions1)
## [1] 20000 6497
ppc_intervals(project1$quality,
yrep = predictions1,
prob = 0.5,
prob_outer = 0.95)
Here we can see that the x axis is data point index and also we can observe that the predicted values fall in certain range and the observed value falls in the predicted range.As we go to the higher end of the spectrum on y axis which is >=4 to <=8 we see that most of the observed values are encompassed by the predicted range, and as we go to the lower end of spectrum we see some descrepant values.we can see some outliers but we can say that our model is a good predictor for most of the values and bad predictor for a few values.
set.seed(2002)
prediction_summary(Wine_model_2, data = project1)
almost 94.10% of the predicted values fall in the 95% range.indicating the model is good one compared to the Wine_model_1, as this is reduced and uses only selected predictors and covers almost as much as the Wine_model_1. as we know The lower the MAE, the better a model fits a dataset, the mae for Wine_model_2 is slightly higher compared to Wine_model_1.
set.seed(84735)
loo_1 <- loo(Wine_model_1)
loo_2 <- loo(Wine_model_2)
c(loo_1$estimates[1], loo_2$estimates[1])
## [1] -7230.037 -7307.900
loo_compare(loo_1, loo_2)
## elpd_diff se_diff
## Wine_model_1 0.0 0.0
## Wine_model_2 -77.9 14.4
loo_1$estimates
## Estimate SE
## elpd_loo -7230.03746 71.11766
## p_loo 15.86357 1.46345
## looic 14460.07493 142.23531
loo_2$estimates
## Estimate SE
## elpd_loo -7307.899774 69.5345745
## p_loo 7.941101 0.2922405
## looic 14615.799548 139.0691491
As we know that the higher the elpd Estimate the the better the model fit.here if we see that Wine_model_1 has higher elpd values -7230.03746 compared to Wine_model_2 -7307.899774.But this is not significantly higher.But An ELPD difference of zero is not within two standard errors of the estimated difference: -77.9 ± 2*(14.4) = (-49.1, -106.7).As both the ELPD adn the difference are saying the same thing, The Wine_model_1 is better fit than model Wine_model_2.
The research questions : a)Is there a significant association between the sulphates in the wine and the quality of the wine?
b)Is there a significant association between the level of alcohol and the quality of the wine?
c)Is there a significant association between the volatile acidity and quality of the wine?
As we can see from the summary of the tidy function from the above section that the sulphates, level of alcohol and volatile acidity are considered significant while predicting the quality of the wine. As their estimates are decent and the confidence intervals doesn’t contain zero in the range proving a statistical significance.
While performing analysis “citric.acid” has been identified as one of the predictors which has got no association with the quality of the wine but according to Cortez et al.(2009) the citric acid levels are more important in white wine to maintain equilibruim between freshness and sweet taste. As the two data sets are merged i think that has given an holistic outcome leaving a scope for misrepresentation, leading to sometimes false outcomes when it comes to considering red and white wine separately, not only the citric acid levels but the pH of the wine seemed to have no association with the quality after the analysis, which might not be true. so the separation with color is needed for the proper analysis of the dataset. Hence i created a separate column called color for further deep analysis of the data.
Additional variable called Color is needed to do further research on the dataset, As the data is already present in public repository no ther approach is needed to collect the data.
Yes, one of the conclusions that i have drawn from the reduced dataset’s analysis is that pH value doesn’t have any association with the quality of the wine but Boulton et al. (1996) says otherwise they say that microbiological stability is being influenced by pH, it affects the equilibrium of tartrate salts, determines the effectiveness of sulfur dioxide and enzyme additions, influences the solubility of proteins and effectiveness of bentonite and affects red wine colour and oxidative and browning reactions.and another conclusion where “citric.acid” has been identified as one of the predictors which has got no association with the quality of the wine but according to Cortez et al.(2009) the citric acid levels are more important in white wine to maintain equilibruim between freshness and sweet taste.
From the above analysis I have come to a conclusion that both red and white wines has got its own way of preparation hence have its own set of predictor values, the analysis here gave us a combined understanding of both red and white where as for the next time analysis adding the attribute color to the merged dataset will fetch far different results than what we have seen today.
In order to first search for a dataset for the final project I felt like one should be clear about the topics in the bayesian rules, otherwise one wouldn’t know which dataset to choose because of the lack of knowledge of where and how to look at and leverage the dataset to one’s use. I have made a similar mistake I have taken a model and trying to search for datasets to fit in that model, which was tedious, after a while I understood all the concepts then the use case of dataset became clearer.The choice of research questions wasn’t random, after going through the pdf which is available with the dataset I have found some statements in the pdf without proofs so I wanted to check its credibility and have proceeded with the analysis. haven’t faced much difficulty regarding the choice of research questions, and I had merged two datasets so the data has been much of mixed between red and white wine this has become a challenge to understand which data belongs to which color of wine.
Like i mentioned before I understood all the concepts of the bayesian rules and then the use case of datasets became clearer and was able to segregate into which model segment the dataset falls and can be analysed. The choice of variables. For the color challenge I have created a seperate column called color through which even though the data is mixed each of the data is tagged with the color of the wine leading to no confusion.
There has been a drastic change in the learning curve of mine. Previously I haven’t had any knowledge on the priors of the model, but here I have learnt how the prior data can be leveraged in order to predict the posterior data. Instead of lm and glm I have learn’t regarding why we are using the stan_glm() function for building the models with priors. have learnt how to leverage loo_compare() in order to find a better fit model, where as previously I used to use only anova test to find that.
##REFERENCES:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. http://www3.dsi.uminho.pt/pcortez/wine/
Boulton, R.B.; Singleton, V.L.; Bisson, L.F.; Kunkee, R.E. 1996. Juice and wine acidity Principles and practices of winemaking. New York: Chapman & Hall: 521–253.
Johnson, A. A., Ott, M. Q., & Dogucu, M. (2021, December 1).Bayes Rules! An Introduction to Applied Bayesian Modeling. Bayes Rules. https://www.bayesrulesbook.com/